Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations

نویسندگان

Robert Strzodka

Mohammed Shaheen

Dawid Paja̧k

چکیده

We compare old single-core multi-processor systems against multi-core processors and study the question which improvements are most relevant for increasing the performance on stencil computations. Even before the multi-core era began, the bandwidth wall, the discrepancy between off-chip bandwidth requirements and system bandwidth performance, was already a significant problem. Because of the currently growing number of parallel cores in CPUs this discrepancy could only be stopped from further deterioration by introducing dual-, tripleand quad-channel memory interfaces. However, this type of off-chip bandwidth scaling is too expensive and thus only a temporary relieve that cannot keep up indefinitely with the exponentially growing number of cores. Therefore, we analyze in particular how the scaling of system and cache bandwidth affects the performance of stencil computations. We evaluate both naive stencil implementations as well as time skewing variants that exploit temporal locality and minimize the number of cache misses in case of iterative stencil computations. We prove certain invariance properties of the schemes and develop a corresponding performance model. Then, we use this model to find out which hardware improvements in the old single-core processors are necessary to match the performance of the new multi-core processors. From this we can draw conclusions about most effective improvements for future processors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Stencil-based kernels constitute the core of many important scientific applications on blockstructured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance mod...

متن کامل

A 3D-Stacked Memory Manycore Stencil Accelerator System

Stencil operations are an important class of scientific computational kernels that are pervasive in scientific simulations as well as in image processing. A key characteristic of this class of computation is that they have a low operational intensity, i.e., the ratio of the number of memory accesses to the number of floating point operations it performs is high. As a result, the performance of ...

متن کامل

An Auto-tuning Jit Compiler for Accelerating Multiple Stencil Computations

We present a JIT compiler with auto-tuning capabilities fusing multiple stencil computations. Data arrays for scientific computing of image processing often exceed cache-memory size. To take advantage of spatial and temporal locality, a common method is to partition the images into tiling blocks for multicore architectures. In realistic scenarios, the multiple image algorithms, most of which ar...

متن کامل

Overcoming Bandwidth Limitations in Visual Computing

Because visual computations are very data intensive they are often limited by the bandwidth of the system rather than its peak computational performance. The trend towards many-core architectures exacerbates the problem because the parallel cores let the compute capability grow exponentially while the system bandwidth increases only linearly. At the core of the bandwidth problem in visual compu...

متن کامل

GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

Spatial blocking is a critical memory-access optimization to efficiently exploit the computing resources of parallel processors, such as many-core GPUs. By reusing cache-loaded data over multiple spatial iterations, spatial blocking can significantly lessen the pressure of accessing slow global memory. Stencil computations, for example, can exploit such data reuse via spatial blocking through t...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Impact of System and Cache Bandwidth on Stencil Computations Across Multiple Processor Generations

نویسندگان

چکیده

منابع مشابه

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

A 3D-Stacked Memory Manycore Stencil Accelerator System

An Auto-tuning Jit Compiler for Accelerating Multiple Stencil Computations

Overcoming Bandwidth Limitations in Visual Computing

GPU-UniCache: Automatic Code Generation of Spatial Blocking for Stencils on GPUs

عنوان ژورنال:

اشتراک گذاری